-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add containers/tei/{cpu,gpu}/1.6.0
#132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
How does this CPU multibackend work? Does it check if there are |
Yes, it tries to download the ONNX weights before, otherwise it rolls back to using |
Also @philschmid, see the logs below as reference on how do those look when running on CPU with a model from the Hub without the ONNX converted weights e.g. One minor nit within the logs is that it claims to have downloaded the
But that's not true and can be missleading since it tries to initialize the ONNX backend when the file's not there cc @OlivierDehaene for reference (happy to open this or contribute to it within the TEI repository if needed!) |
Description
This PR adds a new container for TEI v1.6.0 just released (see the release notes at https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.6.0).
The main feature on TEI v1.6.0 w.r.t. TEI v1.5.0 is that it now supports multiple CPU backends, not just ONNX, meaning that it can also serve embedding models on CPU with backends other than ONNX (since not every model on the Hub comes with an ONNX-converted version of the weights). Some other features include the addition of the General Text Embeddings (GTE) heads, the implementation of MPNet, fixes around the health checks, and much more.
Note
Note that this PR also includes the changes from the https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.5.1 release.
To inspect the changes required to make the TEI container work in GCP, see the diff at: